dplyrPivot tables are powerful tools in Excel for summarizing data in
different ways. We will create these tables using the
group_by and summarize functions from the
dplyr package (part of the Tidyverse). We will also learn
how to format tables and practice creating a reproducible report using
RMarkdown and sharing it with GitHub.
Data used in the synthesis section:
In R, we can use the dplyr package for pivot tables by
using 2 functions group_by and summarize
together with the pipe operator %>%. We will also
continue to emphasize reproducibility in all our analyses.
group_by() %>% summarize() from the
dplyr packagemutate() and select() to work
column-wisedplyr website:
dplyr.tidyverse.orgWikipedia describes a pivot table as a “table of statistics that summarizes the data of a more extensive table…this summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way.”
Aside: Wikipedia also says that “Although pivot table is a generic term, Microsoft trademarked PivotTable in the United States in 1994.”
Pivot tables are a really powerful tool for summarizing data, and we can have similar functionality in R — as well as nicely automating and reporting these tables.
We will first have a look at our data, demo using pivot tables in Excel, and then create reproducible tables in R.
When reading in Excel files (or really any data that isn’t yours), it can be a good idea to open the data and look at it so you know what you’re up against.
Let’s open the lobsters.xlsx data in Excel.
It’s one sheet, and it’s rectangular. In this data set, every row is a unique observation. This is called “uncounted” data; you’ll see there is no row for how many lobsters were seen because each row is an observation, or an “n of 1”.
But also notice that the data doesn’t start until line 5; there are 4 lines of metadata — data about the data that is super important! — that we don’t want to muddy our analyses.
Now your first idea might be to delete these 4 rows from this Excel sheet and save them on another, but we also know that we need to keep the raw data raw. So let’s not touch this data in Excel, we’ll remove these lines in R. Let’s do that first so then we’ll be all set.
Let’s start a new RMarkdown file in our repo, at the top-level (where
it will be created by default in our Project). I’ll call mine
pivot_lobsters.Rmd.
In the setup chunk, let’s attach our libraries and read in our
lobster data. In addition to the tidyverse package we will
also use the skimr package. You will have to install it,
but don’t want it to be installed every time you write your code. The
following is a nice convention for having the install instructions
available (on the same line) as the library() call.
## attach libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
library(readxl)
## Warning: package 'readxl' was built under R version 4.3.3
library(here)
## Warning: package 'here' was built under R version 4.3.3
library(skimr) # install.packages('skimr')
## Warning: package 'skimr' was built under R version 4.3.3
library(kableExtra) # install.packages('kableExtra')
## Warning: package 'kableExtra' was built under R version 4.3.3
We used the read_excel() before, which is the generic
function that reads both .xls and .xlsx files. Since we know that this
is a .xlsx file, we will demo using the read_xlsx()
function.
We can expect that someone in the history of R and especially the
history of the readxl package has needed to skip lines at
the top of an Excel file before. So let’s look at the help pages
?read_xlsx: there is an argument called skip
that we can set to 4 to skip 4 lines.
## read in data
lobsters <- read_xlsx(here("data/lobsters.xlsx"), skip=4) # skip first 4 rows
Great. We’ve seen this data in Excel so I don’t feel the need to use
head() here like we’ve done before, but I do like having a
look at summary statistics and classes.
skimr::skimTo look at summary statistics we’ve used summary, which
is good for numeric columns, but it doesn’t give a lot of useful
information for non-numeric data. So it means it wouldn’t tell us how
many unique sites there are in this dataset. To have a look there I like
using the skimr package:
# explore data
skimr::skim(lobsters)
This skimr:: notation is a reminder to me that
skim is from the skimr package. It is a nice
convention: it’s a reminder to others (especially you!).
skim lets us look more at each variable. Here we can
look at our character variables and see that there are 5 unique sites
(in the n_unique output). Also, I particularly like looking
at missing data. There are 6 missing values in the size_mm
variable.
So now we have an idea of our data. But now we have a task: we’ve been asked by a colleague to report about how the average size of lobsters has changed for each site across time.
We will complete this task with R by using the dplyr
package for data wrangling, which we will do after demoing how this
would do it with pivot tables in Excel.
I will demo how we will make a pivot table with our lobster data. You are welcome to sit back and watch rather than following along.
First let’s summarize how many lobsters were counted each year. This means I want to count of rows by year.
So to do this in Excel we would initiate the Pivot Table Process:
Excel will ask what data I would like to include, and it will do its best to suggest coordinates for my data within the spreadsheet (it can have difficulty with non-rectangular or “non-tidy” data). It does a good job here of ignoring those top lines of data description.
It will also suggest we make our PivotTable in a new worksheet.
And then we’ll see our new sheet and a little wizard to help us create the PivotTable.
I want to start by summarizing by year, so I first drag the
year variable down into the “Rows” box. What I see at this
point are the years listed: this confirms that I’m going to group by
years.
And then, to summarize the counts for each year, I actually drag the
same year variable into the “Values” box. And it will
create a Pivot Table for me! But “sum” as the default summary statistic;
this doesn’t make a whole lot of sense for summarizing years. I can
click the little “I” icon to change this summary statistic to what I
want: Count of year.
A few things to note:
So pivot tables are great because they summarize the data and keep the raw data raw — they even promote good practice because they by default ask you if you’d like to present the data in a new sheet rather than in the same sheet.
We can include multiple variables in our PivotTable. If we want to add site as a second variable, we can drag it down:
But this is comparing sites within a year; we want to compare years within a site. We can reverse the order easily enough by dragging (you just have to remember to do all of these steps the next time you’d want to repeat this):
So in terms of our full task, which is to compare the average lobster size by site and year, we are on our way! I’ll leave this as a cliff-hanger here in Excel and we will carry forward in R.
Just to recap what we did here: we told Excel we wanted to group by
something (here: year and site) and then
summarize by something (here: count, not sum!)
group_by() %>% summarize()In R, we can create the functionality of pivot tables with the same logic: we will tell R to group by something and then summarize by something. Visually, it looks like this:
This graphic is from RStudio’s old-school data wrangling cheatsheet; all cheatsheets available from https://rstudio.com/resources/cheatsheets). It’s incredibly powerful to visualize what we are talking about with our data when do do these kinds of operations.
And in code, it looks like this:
data %>%
group_by() %>%
summarize()
It reads: “Take the data and then group by something and then summarize by something”.
The pipe operator %>% is a really critical feature of
the dplyr package, originally created for the
magrittr package. It lets us chain together steps of our
data wrangling, enabling us to tell a clear story about our entire data
analysis. This is not only a written story to archive what we’ve done,
but it will be a reproducible story that can be rerun and remixed. It is
not difficult to read as a human, and it is not a series of clicks to
remember. Let’s try it out!
group_by one variableLet’s use group_by() %>% summarize() with our
lobsters data, just like we did in Excel. We will first
group_by year and then summarize by count, using the function
n() (in the dplyr package). n()
counts the number of times an observation shows up, and since this is
uncounted data, this will count each row.
We can say this out loud while we write it: “take the lobsters data
and then group_by year and then summarize by count in a new column we’ll
call count_by_year”.
## # A tibble: 5 × 2
## year count_by_year
## <dbl> <int>
## 1 2012 231
## 2 2013 243
## 3 2014 510
## 4 2015 1100
## 5 2016 809
Notice how together, group_by and summarize
minimize the amount of information we see. We also saw this with the
pivot table. We lose the other columns that aren’t involved here.
Question: What if you don’t group_by first? Let’s try it and discuss what’s going on.
lobsters %>%
summarize(count = n())
## # A tibble: 1 × 1
## count
## <int>
## 1 2893
So if we don’t group_by first, we will get a single
summary statistic (sum in this case) for the whole dataset.
Another question: what if we only group_by?
lobsters %>%
group_by(year)
## # A tibble: 2,893 × 7
## # Groups: year [5]
## year month date site transect replicate size_mm
## <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 2012 8 8/20/12 ivee 3 A 70
## 2 2012 8 8/20/12 ivee 3 B 60
## 3 2012 8 8/20/12 ivee 3 B 65
## 4 2012 8 8/20/12 ivee 3 B 70
## 5 2012 8 8/20/12 ivee 3 B 85
## 6 2012 8 8/20/12 ivee 3 C 60
## 7 2012 8 8/20/12 ivee 3 C 65
## 8 2012 8 8/20/12 ivee 3 C 67
## 9 2012 8 8/20/12 ivee 3 D 70
## 10 2012 8 8/20/12 ivee 4 B 85
## # ℹ 2,883 more rows
R doesn’t summarize our data, but you can see from the output that it is indeed grouped. However, we haven’t done anything to the original data: we are only exploring. We are keeping the raw data raw.
To convince ourselves, let’s now check the lobsters
variable. We can do this by clicking on lobsters in the
Environment pane in RStudio.
We see that we haven’t changed any of our original data that was stored in this variable. (Just like how the pivot table didn’t affect the raw data on the original sheet).
Aside: You’ll also see that when you click on the variable name in the Environment pane,
View(lobsters)shows up in your Console.View()(capital V) is the R function to view any variable in the viewer. So this is something that you can write in your RMarkdown script, although RMarkdown will not be able to knit this view feature into the formatted document. So, if you want includeView()in your RMarkdown document you will need to either comment it out#View()or addeval=FALSEto the top of the code chunk so that the full line reads{r, eval=FALSE}.
group_by multiple variablesGreat. Now let’s summarize by both year and site like we did in the
pivot table. We are able to group_by more than one
variable. Let’s do this together:
## # A tibble: 25 × 3
## # Groups: site [5]
## site year count_by_siteyear
## <chr> <dbl> <int>
## 1 aque 2012 38
## 2 aque 2013 32
## 3 aque 2014 100
## 4 aque 2015 83
## 5 aque 2016 48
## 6 carp 2012 78
## 7 carp 2013 93
## 8 carp 2014 79
## 9 carp 2015 90
## 10 carp 2016 231
## # ℹ 15 more rows
We put the site first because that is what we want as an end product. But we could easily have put year first. We saw visually what would happen when we did this in the Pivot Table.
Great.
summarize multiple variablesWe can summarize multiple variables at a time.
So far we’ve summarized the count of lobster observations. Let’s also
calculate the mean and standard deviation. First let’s use the
mean() function to calculate the mean. We do this within
the same summarize() function, but we can add a new line to
make it easier to read. Notice how when you put your curser within the
parenthesis and hit return, the indentation will automatically
align.
## # A tibble: 25 × 4
## # Groups: site [5]
## site year count_by_siteyear mean_size_mm
## <chr> <dbl> <int> <dbl>
## 1 aque 2012 38 71
## 2 aque 2013 32 72.1
## 3 aque 2014 100 76.9
## 4 aque 2015 83 68.5
## 5 aque 2016 48 68.7
## 6 carp 2012 78 74.4
## 7 carp 2013 93 76.6
## 8 carp 2014 79 NA
## 9 carp 2015 90 70.7
## 10 carp 2016 231 68.9
## # ℹ 15 more rows
Aside Command-I will properly indent selected lines.
Great! But this will actually calculate some of the means as NA
because one or more values in that year are NA. So we can pass an
argument that says to remove NAs first before calculating the average.
Let’s do that, and then also calculate the standard deviation with the
sd() function:
## # A tibble: 25 × 5
## # Groups: site [5]
## site year count_by_siteyear mean_size_mm sd_size_mm
## <chr> <dbl> <int> <dbl> <dbl>
## 1 aque 2012 38 71 10.2
## 2 aque 2013 32 72.1 12.3
## 3 aque 2014 100 76.9 9.32
## 4 aque 2015 83 68.5 12.6
## 5 aque 2016 48 68.7 12.5
## 6 carp 2012 78 74.4 14.6
## 7 carp 2013 93 76.6 8.71
## 8 carp 2014 79 79.1 8.57
## 9 carp 2015 90 70.7 14.6
## 10 carp 2016 231 68.9 12.5
## # ℹ 15 more rows
So we can make the equivalent of Excel’s pivot table in R with
group_by() %>% summarize().
Now we are at the point where we actually want to save this summary information as a variable so we can use it in further analyses and formatting.
So let’s add a variable assignment to that first line:
kable()There are several options for formatting tables in RMarkdown; we’ll
show one here from the kableExtra package and learn more
about it tomorrow.
It works nicely with the pipe operator, so we can build do this from our new object:
## make a table with our new variable
siteyear_summary %>%
kable()
| site | year | count_by_siteyear | mean_size_mm | sd_size_mm |
|---|---|---|---|---|
| aque | 2012 | 38 | 71.00000 | 10.150223 |
| aque | 2013 | 32 | 72.12500 | 12.262584 |
| aque | 2014 | 100 | 76.92000 | 9.321074 |
| aque | 2015 | 83 | 68.45783 | 12.555536 |
| aque | 2016 | 48 | 68.68750 | 12.510687 |
| carp | 2012 | 78 | 74.35897 | 14.616282 |
| carp | 2013 | 93 | 76.56989 | 8.709562 |
| carp | 2014 | 79 | 79.08974 | 8.569329 |
| carp | 2015 | 90 | 70.65556 | 14.646517 |
| carp | 2016 | 231 | 68.90476 | 12.470122 |
| ivee | 2012 | 26 | 66.07692 | 12.092719 |
| ivee | 2013 | 40 | 73.77500 | 7.640941 |
| ivee | 2014 | 132 | 76.02273 | 17.860984 |
| ivee | 2015 | 361 | 69.80332 | 17.470534 |
| ivee | 2016 | 193 | 71.61658 | 13.450454 |
| mohk | 2012 | 83 | 77.25301 | 10.587433 |
| mohk | 2013 | 15 | 71.86667 | 10.190098 |
| mohk | 2014 | 36 | 75.75000 | 10.038142 |
| mohk | 2015 | 296 | 59.19932 | 16.770357 |
| mohk | 2016 | 210 | 63.01286 | 11.875763 |
| napl | 2012 | 6 | 73.00000 | 11.747340 |
| napl | 2013 | 63 | 75.31746 | 12.989854 |
| napl | 2014 | 163 | 79.51572 | 9.556531 |
| napl | 2015 | 270 | 78.24074 | 12.438899 |
| napl | 2016 | 127 | 74.39370 | 10.732060 |
Before we let you try this on your own, let’s go outside of our code chunk and write in Markdown.
I want to demo something that is a really powerful RMarkdown feature that we can already leverage with what we know in R.
Write this in Markdown but replace the # with a backtick (`): “There are #r nrow(lobsters)# total lobsters included in this report.” Let’s knit to see what happens.
I hope you can start to imagine the possibilities. If you wanted to write which year had the most observations, or which site had a decreasing trend, you would be able to.
size_mm
variable and function to calculate the median (Hint: ?median)Then, save, commit, and push your .Rmd, .html, and .png.
Solution (no peeking):
siteyear_summary <- lobsters %>%
group_by(site, year) %>%
summarize(count_by_siteyear = n(),
mean_size_mm = mean(size_mm, na.rm = TRUE),
sd_size_mm = sd(size_mm, na.rm = TRUE),
median_size_mm = median(size_mm, na.rm = TRUE))
## `summarise()` has grouped output by 'site'. You can override using the
## `.groups` argument.
## a ggplot option:
ggplot(data = siteyear_summary, aes(x = year, y = median_size_mm, color = site)) +
geom_line()
ggsave(here("figures", "lobsters-line.png"))
## Saving 7 x 5 in image
## another option:
ggplot(siteyear_summary, aes(x = year, y = median_size_mm)) +
geom_col() +
facet_wrap(~site)
ggsave(here("figures", "lobsters-col.png"))
## Saving 7 x 5 in image
Don’t forget to knit, commit, and push!
Nice work everybody.
Oh no! After all our analyses and everything we’ve done, our
colleague just emailed us at 4:30pm on Friday that he sent the wrong
data and we need to redo all our analyses with a new .xlsx file:
lobsters2.xlsx, not lobsters.xlsx. Aaaaah!
If we were doing this in Excel, this would be a bummer; we’d have to rebuild our pivot table and click through all of our logic again. And then export our figures and save them into our report.
But, since we did it in R, we are much safer. R’s power is not only in analytical power, but in automation and reproducibility.
This means we can go back to the top of our RMarkdown file, and read in this new data file, and then re-knit. We will still need to check that everything outputs correctly, (and that column headers haven’t been renamed), but our first pass will be to update the filename and re-knit:
## read in data
lobsters <- read_xlsx(here("data/lobsters2.xlsx"), skip=4)
And now we can see that our plot updated as well:
siteyear_summary <- lobsters %>%
group_by(site, year) %>%
summarize(count_by_siteyear = n(),
mean_size_mm = mean(size_mm, na.rm = TRUE),
sd_size_mm = sd(size_mm, na.rm = TRUE),
median_size_mm = median(size_mm, na.rm = TRUE), )
## `summarise()` has grouped output by 'site'. You can override using the
## `.groups` argument.
siteyear_summary
## a ggplot option:
ggplot(data = siteyear_summary, aes(x = year, y = median_size_mm, color = site)) +
geom_line()
ggsave(here("figures", "lobsters-line.png"))
## another option:
ggplot(siteyear_summary, aes(x = year, y = median_size_mm)) +
geom_col() +
facet_wrap(~site)
ggsave(here("figures", "lobsters-col.png"))
So cool.
dplyr::count()Now that we’ve spent time with group_by %>% summarize, there is a
shortcut if you only want to summarize by count. This is with a function
called count(), and it will group_by your selected
variable, count, and then also ungroup. It looks like this:
## # A tibble: 35 × 3
## site year n
## <chr> <dbl> <int>
## 1 aque 2012 38
## 2 aque 2013 32
## 3 aque 2014 100
## 4 aque 2015 83
## 5 aque 2016 48
## 6 aque 2017 67
## 7 aque 2018 54
## 8 carp 2012 78
## 9 carp 2013 93
## 10 carp 2014 79
## # ℹ 25 more rows
## # A tibble: 35 × 3
## site year n
## <chr> <dbl> <int>
## 1 aque 2012 38
## 2 aque 2013 32
## 3 aque 2014 100
## 4 aque 2015 83
## 5 aque 2016 48
## 6 aque 2017 67
## 7 aque 2018 54
## 8 carp 2012 78
## 9 carp 2013 93
## 10 carp 2014 79
## # ℹ 25 more rows
Hey, we could update our RMarkdown text knowing this: There are #r count(lobsters)# total lobsters included in this summary.
Switching gears…
mutate()There are a lot of times where you don’t want to summarize your data,
but you do want to operate beyond the original data. This is often done
by adding a column. We do this with the mutate() function
from dplyr. Let’s try this with our original lobsters data.
The sizes are in millimeters but let’s say it was important for them to
be in meters. We can add a column with this calculation:
lobsters %>%
mutate(size_m = size_mm / 1000)
If we want to add a column that has the same value repeated, we can
pass it just one value, either a number or a character string (in
quotes). And let’s save this as a variable called
lobsters_detailed
lobsters_detailed <- lobsters %>%
mutate(size_m = size_mm / 1000,
millenia = 2000,
observer = "Allison Horst")
select()We will end with one final function, select. This is how
to choose, retain, and move your data by columns:
Let’s say that we want to present this data finally with only columns for date, site, and size in meters. We would do this:
lobsters_detailed %>%
select(date, site, size_m)
One last time, let’s knit, save, commit, and push to GitHub.